AFIT /GAM/ENC/08-01 


RISK-BASED  COMPARISON  OF  CLASSIFICATION  SYSTEMS 

THESIS 

Seth  B.  Wagenman,  Second  Lieutenant,  USAF 
AFIT/ GAM/ENC/08-01 

DEPARTMENT  OF  THE  AIR  FORCE 
AIR  UNIVERSITY 

AIR  FORCE  INSTITUTE  OF  TECHNOLOGY 

Wright-Patterson  Air  Force  Base,  Ohio 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED. 


The  views  expressed  in  this  thesis  are  those  of  the  author  and  do  not  reflect  the  official 
policy  or  position  of  the  United  States  Air  Force,  Department  of  Defense,  or  the  United 
States  Government. 


AFIT /GAM/ENC/08-01 


RISK-BASED  COMPARISON  OF  CLASSIFICATION  SYSTEMS 


THESIS 


Presented  to  the  Faculty 
Department  of  Mathematics  and  Statistics 
Graduate  School  of  Engineering  and  Management 
Air  Force  Institute  of  Technology 
Air  University 

Air  Education  and  Training  Command 
In  Partial  Fulfillment  of  the  Requirements  for  the 
Degree  of  Master  of  Science 


Seth  B.  Wagenman,  B.S. 
Second  Lieutenant,  USAF 


March  2008 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED. 


AFIT /GAM/ENC/08-01 


RISK-BASED  COMPARISON  OF  CLASSIFICATION  SYSTEMS 


Seth  B.  Wagenman,  B.S. 
Second  Lieutenant,  USAF 


Approved: 


Steven  N.  Thorsen  (Chair)  Date 


Mark  E.  Oxley  (Member)  Date 


David  M.  Kaziska  (Member) 


Date 


AFIT /GAM/ENC/08-01 


Abstract 

Performance  measures  for  families  of  classification  system  families  that  rely  upon  the 
analysis  of  receiver  operating  characteristics  (ROCs),  such  as  area  under  the  ROC  curve 
(AUC),  often  fail  to  fully  address  the  issue  of  risk,  especially  for  classification  systems 
involving  more  than  two  classes.  For  the  general  case,  we  denote  matrices  of  class 
prevalences,  costs,  and  class-conditional  probabilities,  and  assume  costs  are  subjectively 
fixed,  acceptable  estimates  for  expected  values  of  class-conditional  probabilities  exist,  and 
mutual  independence  between  a  variable  in  one  such  matrix  and  those  of  any  other  matrix. 
The  ROC  Risk  Functional  (RRF),  valid  for  any  finite  number  of  classes,  has  an  associated 
parameter  argument,  that  which  specifies  a  member  of  a  family  of  classification  systems, 
and  which  system  minimizes  Bayes  risk  over  the  family.  We  typify  joint  distributions  for 
class  prevalences  over  standard  simplices  by  means  of  uniform  and  beta  distributions,  and 
create  a  family  of  classification  systems  using  actual  data,  testing  independence 
assumptions  under  two  such  class  prevalence  distributions.  We  minimize  risk  under  two 
different  sets  of  costs. 
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RISK-BASED  COMPARISON  OF  CLASSIFICATION  SYSTEMS 


I.  Introduction  and  Mathematical  Foundations 

The  concept  of  risk  is  a  major  feature  of  Bayesian  decision  theory  [5,  pp.  24-28],  [18,  p. 
437].  Its  power  is  evident  in  that  it  takes  into  account  not  only  the  relative  severity  of 
expected  conditional  losses  for  each  possible  decision,  but  also  the  likelihood  of  events  upon 
which  the  occurrence  of  each  loss  is  conditioned.  It  allows  definition  of  these  quantities 
through  the  use  of  either  discrete  or  continuous  random  variables,  or  a  combination  of 
both.  In  this  way,  it  accounts  for  many  characteristics  of  the  operating  environment. 

The  term  Receiver  Operating  Characteristic  (ROC)  seems  to  refer  directly  to  this 
type  of  decision-theoretical  framework,  yet  practical  applications  of  decision  theory  in 
which  this  term  appears  often  ignore  critical  aspects  of  Bayesian  theory.  To  show  this,  a 
brief  introduction  to  ROC  analysis  is  necessary,  as  is  a  precise  set  of  mathematical 
definitions,  to  establish  a  framework  for  possible  correction  of  these  oversights. 

1 . 1  Introduction 

The  field  of  Receiver  Operating  Characteristic  analysis  emerged  in  the  1940s,  during 
early  attempts  to  discern  between  the  presence  or  absence  of  signals  amidst  noise  [6,  pp. 
1-2].  Since  there  are  only  two  possible  outcomes,  such  a  signal  detection  process  is  an 
example  of  two-class  or  binary  classification. 

In  signal  detection,  there  are  two  possible  classification  errors — falsely  perceiving  the 
presence  of  signals  amidst  noise  when  there  is  only  noise,  and  failing  to  detect  a  signal  in 
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the  midst  of  noise.  One  representation  of  the  ROC  for  a  binary  classification  system  is 
simply  a  vector  of  the  likelihoods  of  these  errors.  If  the  method  of  classification  changes,  so 
may  these  likelihoods,  thereby  generating  a  different  ROC  vector.  A  collection  of  estimates 
for  such  ROC  vectors  plotted  on  a  unit  square  may  offer  limited  visual  insight  into 
comparison  of  the  classification  methods  whose  characteristic  behavior  they  represent,  and 
even  more  when  other  factors  of  interest  are  plotted  on  a  third  axis  [10].  Many  authors 
have  developed  advanced  geometrical  frameworks  relating  to  the  points  so  plotted,  due  to 
the  common  practice  of  calculating  areas  under  a  curve  constructed  of  these  plotted 
points  [8],  [10],  [12],  [13],  [14],  [21],  [22],  [26],  [34],  [36].  The  use  of  ROCs  in  such 
comparative  decision-making  is  referred  to  as  ROC  analysis. 

Even  though  ROC  analysis  is  used  in  many  fields  to  compete  binary  classification 
techniques,  it  appears  that  very  few  of  its  proponents  have  fully  realized  the  importance 
and  potential  of  risk-based  comparison  as  a  tool  for  comparing  classification  techniques, 
especially  those  in  which  there  are  more  than  two  distinct  classes  [30,  pp.  57-64],  [31,  p. 
352],  [32,  p.  4],  Although  practical  risk-based  comparison  of  classification  systems  requires 
what  could  be  considered  unrealistic  independence  assumptions  to  enable  the  risk 
calculations,  the  possible  insights  gained  when  these  assumptions  are  met  may  at  least 
justify  the  expense  of  testing  them,  and  when  viewed  in  light  of  the  implicit  assumptions 
connected  with  a  failure  to  fully  consider  all  elements  of  the  risk  calculation,  these 
assumptions  may  not  be  harsh  at  all.  Since  the  failure  to  meet  these  assumptions  is  rarely 
mentioned  in  modern  ROC  analysis  literature,  the  reason  for  not  calculating  risk  may 
simply  be  the  lack  a  practical  and  rigorous  mathematical  framework  for  its  analysis.  There 
is  recent  work,  however,  which  constitutes  a  foundation  on  which  to  build  a  framework  for 
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measuring  the  performance  of  classification  systems,  with  Bayes  risk  as  the  measure  of 
optimality,  and  incorporating  some  of  the  independence  assumptions  mentioned  above  [32], 
The  intent  of  this  thesis  is  neither  to  show  how  these  assumptions  may  be  met,  nor  to 
stipulate  as  to  the  relative  importance  of  actually  meeting  them,  but  instead  to  show  how 
they  may  be  tested,  and  then  to  assume  that  even  if  they  are  not  met,  the  disadvantage  of 
such  failure  is  mitigated  by  the  ability  to  easily  calculate  risk.  The  major  point  of  interest 
is  that  geometric  analyses  are  replaced  by  risk-based  comparisons,  thereby  possibly 
lessening  the  need  to  construct  curves  or  surfaces,  or  to  calculate  geometric  quantities. 

1.2  Definitions  and  Assumptions 

Before  proceeding,  it  is  necessary  to  define  notation  and  terms  relating  to  general 
classification  theory  and  ROC  analysis.  Examples  from  the  held  of  signal  detection  will 
illustrate  selected  concepts. 

Definition  1  (Experiment).  An  experiment  is  a  complex  of  reproducible  conditions 
resulting  in  a  set  of  well-defined  outcomes  [16,  pp.  3-5],  [29,  p.  32], 

For  example,  the  presence  of  electromagnetic  radiation  constitutes  a  possible  signal 
detection  experiment. 

Definition  2  (Elementary  Event).  An  elementary  event  is  an  experimental  outcome  which 
cannot  be  further  decomposed  into  other,  more  basic  experimental  outcomes  [33,  p.  26]. 

An  elementary  event  in  signal  detection  could  be  e  =  a  detectable  instance  of 
electromagnetic  radiation  exists. 

Definition  3  (Event  Set).  An  event  set  is  the  set  E  =  {ca } ^ e  a  °f  possible  elementary 
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events  resulting  from  a  given  experiment,  where  the  index  set  A  may  be 
uncountable  [16,  pp.  3-5],  [29,  p.  33],  [33,  p.  27]. 

Definition  4  (Sensor,  Data  Set).  A  function  s  with  event  set  domain  E,  whose  action  is  to 
observe  elementary  events  e  and  gather  data  about  them;  therefore,  the  range  of  a  sensor 
is  a  set  D  of  data  elements  de  corresponding  to  elementary  events  observed  [32,  p.  1], 

In  signal  detection,  a  data  set  could  be  a  hard  disk  containing  information  gathered 
through  a  cable  connected  to  a  radio  signal  detection  machine. 

Definition  5  (Processor,  Feature  Set).  Given  a  data  set  D ,  a  processor  is  a  function  p 
with  data  subset  domain  D'CD,  whose  action  is  to  transform  data  corresponding  to 
elementary  events  e  observed  by  a  sensor  s :  E  — >  D  (whose  range  is  an  event  set  E  and 
whose  range  contains  the  domain  of  p)  into  a  vector  of  numbers,  usually  real-valued; 
therefore,  the  range  of  a  processor  is  a  feature  set  F  of  finite- dimensional  vectors  fe 
corresponding  to  elementary  events  e  whose  representational  data  points  de  are  elements 
of  the  domain  of  p  [32,  p.  1],  An  element  fe  G  F  of  a  feature  set  is  called  an  exemplar. 

In  signal  detection,  a  processor  could  be  a  computer  that  receives  a  floppy  disk 
containing  some  of  the  data  gathered  by  a  sensor  and  performs  calculations  to  produce  a 
matrix  of  real  numbers,  with  columns  corresponding  to  wave  amplitude  and  frequency 
variables,  and  with  exemplars  as  row  vectors  in  the  matrix  corresponding  to  elementary 
events  observed  by  the  sensor.  Part  of  these  calculations  may  also  create  new  variables  that 
are  related  to,  but  not  defined  strictly  the  same  as,  the  variables  observed  by  the  sensor. 

For  example,  Principal  Components  Analysis  (PGA)  is  a  method  of  reducing  the  number  of 
features,  by  creating  a  few  linear  combinations  of  them  which  explain  most  of  the  variance 
in  the  original  features  matrix.  The  coefficients  of  each  of  these  linear  combinations  are 
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applied  to  each  row  of  the  original  feature  data  to  produce  a  principal  component  score, 
which  in  turn  becomes  a  new  feature  variable. 

Definition  6  (Event).  Any  subset  A  C  E  of  an  event  set  is  called  an  event  [29,  p.  34], 
Note  that  any  set  E  is  always  regarded  as  a  subset  E  C  E  of  itself,  and  the  empty  set  0 
is  a  subset  of  every  set  besides  itself,  even  though  we  may  not  explicitly  denote  its  presence. 
Also,  for  A  C  E,  if  any  elementary  event  e  G  A  occurs,  then  A  has  also  occurred. 

In  signal  detection,  the  sets  £i  =  {radiation  with  signals  amidst  noise  is  present }, 
and  £2  =  {only  noise  is  present}  are  all  subsets  of  the  event  set  E  =  ^  U  £2,  as  is  the  set 
E  itself;  therefore,  each  set  listed  above  constitutes  an  event. 

Definition  7  (Finite  Set  Partition).  Given  a  non-empty  set  E  and  a  finite  index  set  A,  a 
collection  of  subsets  {£a  C  E}AgA  is  a  finite  set  partition  of  E  when  the  following  hold: 

(i)  £  a  fi  £  n  =  0,  V  pL,  A  G  A  3  n  A,  and 

(»)  U  ^  =  E 

AeA 

i.e. ,  {£a  C  E}AgA  is  a  finite  collection  of  mutually  exclusive  subsets  of  E  whose  union  is 
the  whole  set  E  [29,  p.  36]. 

Definition  8  (Classifier,  Label  Set).  A  classifier  is  a  function  c  with  feature  subset 
domain  F'cF,  whose  action  assigns  exactly  one  label  i  out  of  a  finite  set  L  of  distinct 
labels  to  each  feature  vector  /  e  F';  therefore,  a  label  set  L  =  {G,£2,  •  •  •  >^n}  is  the 

range  of  an  n-class  classifier  c:  F'  — >  L,  such  that  if  an  event  set  E  is  the  domain  of  a 
sensor  s :  E  — >  D  whose  range  contains  the  domain  D'  C  D  of  a  processor  p:  D'  — >  F, 
whose  range  in  turn  contains  the  domain  F'cF  of  c,  then  L  partitions  E  into  a  set  of  n 
mutually  exclusive  subsets  {£j}jn=1  called  classes ,  whose  union  is  the  entire  event  set  E, 
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such  that  each  class  £j  C  E  corresponds  to  exactly  one  label  G  L  [32,  pp.  1-2]. 

A  signal  detection  classifier  could  be  an  artificial  neural  network  operating  on  rows 
(exemplars)  extracted  from  a  principal  component  score  matrix  whose  row  vectors 
correspond  to  particular  instances  of  electromagnetic  radiation.  Note  that  the  method  of 
creating  or  training  such  a  classifier,  as  well  as  testing  it  against  a  subset  of  the  data  from 
which  it  is  created,  is  subjective;  for  example,  a  binary  classifier  could  be  flipping  a  fair  coin. 

The  signal  detection  label  set  L  =  {£i,£2}  (where  elementary  events  in  class  £1  = 
{radiation  with  signals  amidst  noise  is  present}  correspond  to  the  label  £1,  and  elementary 
events  in  class  £2  =  {only  noise  is  present}  correspond  to  the  label  £2)  induces  the  finite 
set  partition  E  =  £i  U  £2,  where  the  event  set  is  E  =  {electromagnetic  radiation  is 
present}.  This  is  an  example  of  a  two-class  partition. 

Definition  9  (Classification  System).  Given  the  following: 

(i)  a  sensor  s:  E  — >  D  with  event  set  domain  E  and  data  set  range  D, 

(ii)  a  processor  p:  D'  — *  F  with  data  subset  domain  D'cD  and  feature  set  range  F, 
and 

(iii)  a  classifier  c:  Fr  — >  L,  with  feature  subset  domain  F' cF  and  label  set  range  L, 

the  composition  A  =  copos:E  — >  L  with  event  set  domain  E  and  label  set  range  L 
is  a  classification  system  A:  E  — >  L  [32,  p.  2], 

Definition  10  (Threshold  Set).  Given  any  feature  set  F,  a  threshold  set  0  of  interest  is  a 
set  of  parameters  6  e  0  that  influence  mappings  with  domain  F.  These  parameters  need 
be  neither  univariate,  continuous,  nor  real-valued. 


6 


A  signal  detection  threshold  parameter  9l  that  is  neither  continuous  nor  real-valued 
could  be  choosing  whether  to  flip  a  quarter  or  a  nickel,  whereas  a  continuous  and 
real-valued  parameter  02  might  be  the  choice  of  a  real-valued  discriminating 
criterion  [5,  pp.  48-49],  Some  types  of  artificial  neural  net  classifiers  have  a  continuous 
parameter  called  the  spread,  such  that  each  setting  of  this  parameter  effectively  defines  a 
new  classifier,  given  a  particular  choice  of  methods  for  training.  A  threshold  set  0  of 
interest  might  also  be  the  Cartesian  product: 

0  =  |  ($1,02)  •  ^1  ^  01)  @2  £  02J'  =  ©1  x  ©2 

of  threshold  sets  @i  and  02  [32,  pp.  1-2],  It  should  be  noted  that  in  practice,  we  only 
consider  a  finite  number  of  threshold  parameters  to  approximate  a  continuous  threshold 
set,  and  so  each  distinct  finite  sample  may  be  considered  a  separate  discrete  threshold  set. 

Definition  11  (Family  of  Classification  Systems  Over  a  Threshold  Set).  Given  a  threshold 
set  0  such  that  the  value  of  the  parameter  6  e  0  determines  the  action  of  a  classifier  eg, 
a  family  of  classification  systems  of  the  form  Ag  =  eg  o  p  o  s  over  the  threshold  set  0  is 
the  collection  A©  =  [Ag :  0  G  0}  of  all  such  classification  systems. 

It  should  be  noted  here  that  when  searching  for  a  classification  system  Ag  to  meet 
some  particular  criterion  from  an  infinite  family  A©  defined  over  a  continuous  threshold 
set  0,  practicality  requires  the  creation  of  finite  families  of  classification  systems  over 
discrete  samples  from  the  continuous  threshold  set.  Even  though  these  samples  are  subsets 
of  the  same  set,  they  may  be  distinct,  and  thus  the  families  of  classification  systems  over 
these  sample  threshold  sets  are  also  distinct. 
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Definition  12  (a-Field).  Given  a  non-empty  set  E  and  a  countable  index  set  A,  a 
collection  S’  of  subsets  A  C  E  is  a  cr-field  over  E  when  the  following  hold  true: 

(i)  EeS, 

(ii)  if  A  £  S’,  then  Ac  G  S,  and 

(iii)  A\  G  S  ,  V  A  G  A  =$■  [^J  A\  G  $ 

AeA 

where  A°  C  E  is  the  complement  {e  G  E:  e  A}  in  E  of  the  subset  A  C  E  [16,  p.  2], 
[27,  pp.  17-18].  The  cr-field  S  may  also  called  a  a- algebra  over  E  [27,  p.  18],  [32,  p.  1], 

Definition  13  (Pre-Image  Set  Function).  Given  a  mapping  m:  E  — >  L  defined  between 
any  sets  E  and  L,  the  pre-image  of  a  subset  A  C  L  is  a  subset  m\A)  C  E  of  E  given  by: 

m\A)  =  {e  G  E:  m(e )  G  A}  C  E  (1) 

where  we  use  the  becaudro  (  ^ )  to  denote  pre-image  instead  of  the  usual  inverse  symbol 

( _1 )  to  avoid  misinterpretation  [32,  pp.  3-4],  The  pre-image  set  function 

mA:  T(L)  — *  T(E)  is  well-defined,  where  T(T)  denotes  the  power  set  {A\  A  C  L}  of  L. 

When  a  signal  detection  system  classifies  instances  of  electromagnetic  radiation  as 
either  containing  signals  or  not,  the  subset  {instances  of  electromagnetic  radiation  classified 
as  containing  signals  amidst  noise }  of  the  event  set  is  the  pre-image  of  a  singleton  subset 
{signals  amidst  noise}  of  the  label  set  {signals  amidst  noise,  noise  alone}. 

Definition  14  (Probability  Measure).  Given  a  u-field  S  over  an  event  set  E,  a  mapping 
P:  S  — >  [0, 1]  is  a  probability  measure  on  S,  or,  in  other  words,  P  is  said  to  be 
measurable  with  respect  to  the  u-field  S,  when  the  following  hold  true  [29,  pp.  41-42]: 


(i)  P(£)  is  defined  for  each  event  £  G  S, 

(ii)  P(E)  =  1,  and 

(iii)  given  any  countable  collection  {fjG  <f}AgA  even^s  SU('h  that 
£An£M  =  0,  V  /i,Ae  A  3  n  7^  A: 

p  [  U  =  E  AD  p) 

V  aga  /  aga 

Note  that  a  given  probability  measure  P:  S  — »  [0, 1]  may  be  measurable  with  respect  to 
other  a-fields  besides  S,  and  that  pre-images  under  a  probability  measure  P  of  all  subsets 
A  C  [0, 1]  are  measurable  sets;  i.e.,  they  are  events  in  the  cr-field  over  which  P  is  defined. 

Definition  15  (Class  Prevalence,  Prior  Probability).  Given  the  following: 

(a)  a  finite  index  set  A  with  cardinality  Card  (A)  =  n, 

(b)  a  label  set  L  with  cardinality  Card(L)  =  n  =  Card(A), 

(c)  an  event  set  E  partitioned  by  L  into  classes  {£i,  ...  ,  £n}  satisfying  U  £i  =  E. 

jGA 

(d)  a  cr-field  S  over  E  such  that  { £j } C  S,  and  lastly, 

(e)  a  probability  measure  P:  S  — >  [0, 1]  defined  on  S, 

the  class  prevalence  pj  for  a  particular  class  £j  is  given  by  pj  =  P(£j) .  Note  that  pj  is 
also  called  the  a  priori  probability — a.k.a.  the  prior  probability — that  a  given  elementary 
event  e  £  E  will  be  contained  in  class  £j,  for  some  jGA.  Since  { £j }  is  a  partition  of 

( n  v= 

E  and  the  probability  measure  P  satisfies  P(E)  =  1  =  P  I  £j  I ,  then  by  Definition 

n  n 

14  above,  we  must  have  5Zp(£j)  =  l  =  5^Pj- 

j=i  j=i 
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Theorem  1  (Bayes  Theorem).  Given  a  probability  measure  P :  S  — *•  [0, 1]  defined  on  a 
a-field  S  over  an  event  set  E,  and  any  two  events  X,  y  G  S ,  the  conditional  probabilities 
P(X |V)  and  P(]$ |X)  have  the  following  scalar  relationship: 


pm 


p(xny) 

P(V) 

P(V  n  X) 

P(V|X)P(X) 

) 

p(y|x) 


P(X) 


(3) 


whenever  P(V)  7^  0;  however,  if  P(f$)  =  0,  fhen  P(X|^)  =  0,  V  X  G  S  [33,  p.  68]. 


Definition  16  (Class-Conditional  Probability).  Given  the  following: 


(a)  a  classification  system  A:  E  — »  L  with  event  set  domain  E  and  label  set  range  L; 

(b)  an  finite  index  set  A  satisfying  Card(L)  =  n  =  Card(A), 

(c)  a  a-field  S  over  E  containing  at  least  the  following  events: 

1.  all  classes  in  the  partition  [^J  £j  =  E  induced  L  on  E;  and 

jeA 

2.  all  pre-images  ({£;})  C  E  of  singleton  label  subsets  {£;}  C  L; 

(d)  a  probability  measure  P:  S  — >  [0, 1]  on  S\  and 

(e)  a  certain  class  £j  with  non-zero  prior  probability  pj  =  P(£j)  7^  0  for  some  j  e  A, 

the  class-conditional  probability  q;|j(A)  is  the  conditional  probability  that  A  assigns  a 
certain  label  A  £  L  to  an  elementary  event  e  £  £j,  and  is  given  by: 
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(4) 


q,ij(^)  =  P  fie  V[{«,}]  |  ee  £j) 

p(-4»[{«]  n«i) 

=  — - t — 7 - —  ,  i,j  =  1,2,3,  ...,n 

For  a  class  £j  with  prior  probability  P(£j)  =  0,  the  class-conditional  probabilities 
conditioned  on  class  £j  are  given  by  qi|j(A)  =  0,  V  i  =  1,  2,  3,  . . . ,  n.  A 
class-conditional  probability  may  take  on  any  value  in  [0, 1],  so  for  each  i  and  j,  the 
class- conditional  probability  q;|j(A)  is  a  well-defined  probability  measure;  therefore,  by 

n 

Definition  14  above,  we  have  E  qiij(A)  =  1,  V  j  =  1,2,3,...  ,n  [29,  p.  54], 

i  =  1 

Assumption  1  (Independence  of  Class  Prevalence  and  Class-Conditional  Probabilities). 
Given  an  n -class  classification  system  A,  any  index  pair  (i,  j) ,  i,j  =  1,  . . .  ,  n,  and  any 
index  k  =  1,  ...  ,n  such  that  class  £  k  satisfies  P(£  k)  0,  the  set  {qi|j(A),pk}  of  any 

class- conditional  probability  q^A)  and  any  non- zero  class  prevalence  pk  is  independent. 

A  class-conditional  probability  might  be  the  likelihood  that  a  signal  detection 
classification  system  A  will  label  an  instance  of  electromagnetic  radiation  as  class  £i , 
where  this  label  indicates  the  presence  of  signals  amidst  noise,  given  that  it  actually 
belongs  to  class  £ 2  (e.g.,  the  instance  observed  by  the  sensor  actually  contains  only  noise): 
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visualize  this,  imagine  the  event  set  E  as  a  unit  square,  with  area  representing  probability. 
As  class  prevalences  change,  so  do  the  sizes  of  the  events  within  E  which  they  define,  so  as 
p2  changes,  the  size  of  the  event  £2  C  E  changes  in  exact  proportion;  Assumption  1  then 
implies  that  the  size  of  the  event  intersection  A^({£ i})  fl  £2  must  also  change  such  that  its 
area  is  scaled  by  the  exact  same  scalar  as  is  the  event  £2. 

There  are  several  statistical  methods  available  to  test  the  validity  of  independence 
between  two  populations  whose  distributions  are  not  both  known,  such  as  Kendall’s 
Tau  [11,  pp.  404-405].  The  null  hypothesis  of  this  particular  non-parametric  test  is  no 
association  or  dependence  between  the  populations  [33,  p.  816]. 


To  the  user  of  a  classification  system  A,  the  conditional  probability 
PteeE,  e  e  A »[{«,}])  may  be  of  far  greater  interest  than  the  c/ass-conditional 


probability  P  (e  e  4>[M]  |  e  6  £,)  ;  however,  the  set  {q^)}^  of 
class-conditional  probabilities  for  A  is  information  by  which  the  system  may  be  judged 
prior  to  use,  since  even  if  Assumption  1  holds  and  c/ass-conditional  probabilities  do  not 


change  with  class  prevalences,  the  class  prevalences  themselves,  such  as  that  in  the  formula: 


P  e  e  £i 


e  e  A'  [{«}])  = 


Pj 


P(A*[{t  i}]) 


may  change  from  moment  to  moment,  even  while  classification  occurs. 


Definition  17  (Conditional  Probability  Matrix).  Given  a  set  |qijj(Ae)|  of 
class-conditional  probabilities  for  a  classification  system  Ag,  the  conditional  probability 


matrix  is  given  by 


Q 


Ag 


..  =  m(Ao)- 


A  collection  of  conditional  probability  matrices  for  various  classification  systems  may 
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be  represented  by  (n2  —  n)-dimensional  vectors  in  the  Cartesian  product  [0,  l]n2-n  C  Mn2~n 
If  one  considers  a  family  A©  =  {Ag:  9  e  0}  of  classification  systems  over  a  threshold  set 
0  of  interest  with  only  continuous  parameters,  a  continuous  (n2  —  n)-dimensional  surface 
may  then  be  constructed  by  infinitesimal  variations  of  these  parameters;  in  practice, 
however,  such  continuous  curves  may  only  be  estimated  by  a  finite  number  of  ROC  vector 
estimates  representing  classification  systems  in  the  family  A©. 

The  most  common  method  of  representing  an  estimate  of  a  class-conditional 
probabilities  is  by  calculating  a  transpose  stochastic  confusion  matrix  from  experimental 
results.  There  are,  of  course,  other  methods  of  obtaining  class-conditional  probability 
estimates,  and  the  distribution  of  ROC  vectors  may  even  be  defined  statistically; 
Assumption  1  then  allows  these  distributions  to  be  treated  separately  from  any 
distributions  attributed  to  class  prevalences. 

To  illustrate  the  calculation  of  a  transpose  stochastic  confusion  matrix,  consider  a 
2x2  contingency  matrix  of  raw  results  for  a  binary  classification  experiment  (or 
observational  study)  with  a  finite  number  of  classification  results  and  a  priori  (or  a 
posteriori,  in  the  case  of  an  observational  study)  knowledge  of  class  populations  for  all 
exemplars  classified.  Such  a  matrix  displays  a  simple  count  of  the  numbers  of  each  type  of 
decision,  including  both  correct  and  incorrect  decisions,  with  correct  decision  counts  along 
the  diagonal  and  with  columns  corresponding  to  the  the  truth,  as  shown  in  Table  1. 


Table  1:  Two-Class  Contingency  Matrix. 


Contingency  Matrix 

Labeled  Class:  1 
Labeled  Class:  2 


Actual  Class:  1  Actual  Class:  2 


TP  FP 

FN  TN 
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Here,  class  1  is  the  so-called  positive  or  target  class,  and  class  2  the  negative ;  hence, 
TP  or  the  true  positive  count  is  how  many  exemplars  from  class  1  were  correctly  labeled, 
and  FN  or  the  false  negative  count  is  how  many  were  not,  etc.  [10,  pp.  69-71]. 

Estimates  of  class-conditional  probabilities  may  be  formed  by  dividing  each  element 
of  a  class-specific  column  in  the  contingency  matrix  by  the  total  number  of  classified 
exemplars  from  that  class.  With  M\  and  M2  exemplars  from  Classes  1  and  2, 
respectively,  undergoing  classification,  we  may  estimate  the  class-conditional  probabilities 
from  Table  1,  as  shown  in  Table  2. 

Table  2:  Two-Class  Confusion  Matrix. 


Confusion  Matrix 


Actual  Class:  1  Actual  Class:  2 


Labeled  Class:  1 
Labeled  Class:  2 


TP 

FP 

Mi 

m2 

FN 

TN 

Mi 

m2 

The  result  is  a  transpose  stochastic  confusion  matrix ,  such  that  the  sum  of  each 
column  is  one.  It  is  worth  mentioning  that  some  authors  prefer  the  proper  stochastic 
presentation,  but  for  the  purposes  of  this  thesis,  the  transpose  stochastic  is  more 
convenient  [6,  pp.  8-9].  Also,  the  term  “confusion  matrix”  sometimes  means  the 
contingency  matrix  denoted  above,  and  a  normalized  form  of  the  contingency  matrix  as 
illustrated  above  (that  which  we  term  a  confusion  matrix)  may  be  specified  as  a  “confusion 
rate  matrix”  or  “confusion  ratio  matrix”  to  avoid  confusion  with  the  non-normalized 
form  [8,  p.  3],  [9,  p.  2],  Due  to  its  transpose  stochastic  nature,  the  information  contained  in 
a  2  x  2  confusion  matrix  of  this  type  may  be  presented  as  a  coordinate  pair  comprised  of 
one  entry  from  each  column,  which  may  then  be  plotted  on  a  unit  square  [30,  pp.  26-28]. 
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Assumption  2  (Acceptable  Class-Conditional  Probability  Estimates).  Without  regard  to 
the  method  of  obtaining  an  estimate  Q A  of  the  conditional  probability  matrix  Qa  for  a 
given  classification  system  A,  assume  that  adequate  estimation  procedures  have  occurred, 
such  that  for  all  practical  purposes,  considering  QA  approximately  equal  to  the  matrix 
E[Qa]  of  expected  values  of  the  elements  of  Q4  results  in  no  appreciable  error;  i.e.,  we 
may  substitute  E[Qa  ]  «  Q4  whenever  it  is  convenient  to  do  so. 

Definition  18  (ROC  Manifold,  ROC  Curve).  Given  an  n-class  classification  problem,  the 
convex  hull  of  a  continuous  collection  of  ROC  vectors  estimates  plotted  in 
(n2  —  n)-dimensional  space  is  often  termed  a  ROC  curve  (for  a  two-class  scenario)  or  a 
ROC  manifold  [30].  The  ROC  Convex  Hull  is  abbreviated  ROCCH. 

If  constructing  the  ROCCH  was  simple,  then  comparing  only  classification  systems 
whose  points  lie  on  hull  might  save  time,  since  no  points  within  the  hull  interior  could 
possibly  represent  classification  systems  superior  to  those  on  the  hull  under  any 
circumstances  [24],  [30].  Such  considerations  would  reduce  the  number  of  classification 
systems  to  compare  and  contrast;  however,  since  the  simplicity  of  ROCCH  calculation,  and 
thus  the  amount  of  time  to  be  possibly  saved,  is  questionable,  the  method  of  comparison 
would  seem  to  be  far  more  important  than  saving  time  during  such  a  comparison, 
depending,  of  course,  on  the  possible  applications  of  the  classification  system.  Except  for 
time-saving  purposes,  such  geometrical  concepts  have  limited  utility  under 
decision-theoretical  constructs,  yet  the  ROCCH,  especially  in  its  binary  form  as  the  ROC 
curve,  has  played  a  huge  role  in  ROC  analysis  for  many  years,  and  are  therefor  worthy  of 
mention;  however,  they  are  not  actually  necessary  considerations  within  the  framework  of 
risk  calculation;  therefore,  this  thesis  will  refer  to  them  only  as  auxiliary  concepts. 


15 


Definition  19  (Cost  Matrix).  A  cost  matrix  given  by  [C]y  =  Cm  is  an  n  x  n  matrix  of 
real  numbers  representing  costs  or  losses  Cip  specific  to  events  G  eE  £jj, 

i.e. ,  classification  system  A  assigns  label  i\  to  an  elementary  event  e  when  it  is  actually 
an  element  of  class  £j,  whose  class- conditional  probability  holds  the  exact  same 
(i,  j)-position  in  the  conditional  probability  matrix  QA  =  qjp(A).  These  costs  may  be 
positive  or  negative,  but  most  often,  the  sum  of  off-diagonal  entries  in  any  column  is  greater 
than  the  diagonal  entry  itself,  indicating  that  it  is  better  (i.e.,  less  costly)  to  classify 
something  as  what  it  actually  is  rather  than  anything  else  [5,  pp.  24-25].  This  matrix  may 
also  be  called  a  “payoff”  matrix,  so  its  meaning  is  almost  completely  subjective  [6,  p.  16]. 
One  common  form  is  the  so-called  “zero-one”  transpose  stochastic  cost  matrix,  with  all 
zeroes  on  the  diagonal  and  each  column  summing  to  one;  however,  it  is  not  necessary  to 
restrict  the  cost  matrix  to  such  a  form  [5,  p.  26],  [32,  p.  7]. 

Assumption  3  (Fixed  Costs).  All  elements  of  a  given  cost  matrix  C  are  fixed. 

This  is  a  necessary  assumption,  because  costs  often  are  the  result  of  human  reasoning, 
which  is  very  unpredictable;  therefore,  it  is  easier  to  simply  choose  different  possible  cost 
regimes  and  perform  risk  calculations  under  each  scenario. 

Assumption  4  (Independence  of  Class-Conditional  Costs  and  Probabilities).  Given  an 
n -class  classification  system  A  and  any  index  pairs  (i,j),  i,  j  =  1,  ...  ,  n  and  (h,  k), 
h,k  =  1,  ...  ,n,  the  set  jqip(A),  Ch|k  j  consisting  of  any  class- conditional  probability 
q;|j(A),  and  any  cost  Ch|k,  is  independent. 

Since  costs  are  subjectively  defined,  it  may  be  possible  to  envision  a  scenario  where 
the  likelihood  of  making  a  particular  type  of  classification  decision  has  a  direct  impact  on 
the  cost  of  such  decision;  however,  it  should  therefore  also  be  possible  to  define  scenarios 


16 


where  costs  do  not  change  as  estimates  of  ROC  information  change. 

Definition  20  (Prevalence  Matrix).  Given  an  set  {pj}”_1  of  class  prevalences  for  a 
classification  system  A,  the  prevalence  matrix  P  is  an  n  x  n  stochastic  matrix  with  each 
row  the  same  ordered  n-tuple  pr  consisting  of  the  class  prevalences  {pj}JU=1 : 


Assumption  5  (Independence  of  Class  Prevalence  and  Class-Conditional  Costs).  Given  an 
n -class  classification  system  A,  any  index  pair  (i,  j) ,  i.j  =  1,  . . .  ,  n,  and  any  index 
k  =  1,  ...  ,n,  the  set  {ci|j ,  Pk}  consisting  of  any  cost  C;|j  and  any  prior  class  probability 
Pk,  is  independent. 

Since  the  definition  of  cost  is  purely  subjective,  it  is  certainly  possible  that  the 
individual  costs  of  making  classification  decisions  may  be  independent  of  the  class 
prevalences.  For  example,  in  signal  detection,  this  might  be  like  assuming  that  it  is  always 
equally  costly  to  assume  an  instance  of  electromagnetic  radiation  contains  noise  alone, 
given  that  it  actually  contains  a  signal,  since  the  binary  nature  of  the  setup  seems  to  imply 
that  there  is  potentially  valuable  information  contained  in  any  type  of  signal. 

Definition  21  (Matrix  Hadamard  Product).  Given  any  two  matrices  U  and  V  of  the 
same  size,  the  binary  Hadamard  Matrix  Operator  0  forms  a  new  matrix  U  ©  V  of  the 
same  size.  A  typical  element  of  the  resultant  matrix  is  given  by: 

[U  0  V],j  =  UijVij  =  VyUy  =  [V  ©  U] y  (7) 
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Definition  22  (Frobenius  Dot  Product).  Given  any  two  matrices  U  and  V  of  size  s  x  r, 
the  Matrix  Dot  Product  Operator  (  ,  )f  performs  the  following  reflexive  binary  operation: 


s  /  r 


r  /  s 


<U,V)F  =  £  £  O  i  u,  i  =£  £  Uij  Vij  )  =  (V,U)F  (8) 

i=  1  V  j  =  1  /  j  =  1  V  i=  1 


which  is  simply  the  sum  of  all  elements  of  the  Hadamard  Matrix  Product  U  ©  V. 


Definition  23  (Standard  m-Simplex).  Given  any  positive  integer  m,  along  with  any 
ordered  m-tuple  p  G  [0,  l]m  of  non-negative  real  variables  pj,  j  =  1,2,3,  .. .  ,  m,  the 
standard  m-simplex  Am  is  the  set  [3,  p.  568],  [20,  pp.  149-150]: 


Am  =  |p€[0,l]m:  £pj  -  ^  (9) 

Figure  1  shows  an  example  of  how  ,  for  a  three-class  scenario,  a  two-dimensional 
prevalence  vector  p  =  (pi,P2)  whose  coordinates  sum  to  a  number  less  than  1  may  be 
drawn  from  A2 .  Note  that  the  unspecified  value  of  P3  is  found  from  the  conjunctive 

3 

equation  ^^pj  =  1,  as  illustrated  in  Figure  2,  where  this  point  lies  on  the  tilted  surface  of 
j  =  i 

a  standard  3-simplex. 

Definition  24  (s  x  r  Random  Matrix).  Given  an  s  x  r  matrix  B  =  [By]  of  event  sets 
and  an  s  x  r  matrix  X  of  functions  xy  defined  on  By,  respectively,  X  is  an  s  x  r  matrix 
of  random  variables,  or  a  random  matrix ,  when  the  codomain  of  each  function 
Xy :  By  — >  M  is  the  set  M  of  real  numbers  [33,  p.  73]. 

There  is  no  stipulation  as  to  what  type  of  event  set  a  random  variable  may  be  defined 
upon;  therefore,  any  or  all  of  the  event  sets  in  a  matrix  B  of  event  sets  may  be  either 
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Pi 


Figure  1:  A2 


p  e  [0,  l] 


2  . 


2 


< 


discrete  or  continuous.  Given  an  n-class  classification  system  A  and  the  corresponding 
real-valued  matrices  Qa,  c,  and  P  introduced  in  Definitions  17,  19,  and  20  above, 
respectively,  note  that  each  is  a  random  matrix  defined  on  a  matrix  of  event  sets. 
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With  regard  to  the  matrices  QA  and  P,  it  is  also  evident  that  there  are  exactly  n 
random  variables  that  are  functions  of  only  (n  —  1)  of  the  random  variables  inhabiting  the 
same  column  or  row,  due  to  the  respective  transpose  stochastic  and  stochastic  natures  of 
these  matrices.  For  example,  since  the  prevalence  matrix  P  is  simply  the  same  random 
vector  p  arrayed  next  to  itself  n  times,  the  definitions  of  all  n  of  these  functionally 
dependent  variables  are  exactly  the  same;  similarly,  of  the  remaining  (n2  —  n)  random 
variables  that  could  be  non-constant,  there  are  actually  only  (n  —  1)  unique  random 
variables.  It  will  become  apparent  later  why  this  notation  is  used;  it  is  sufficient  for  now  to 
notice  that  any  joint  distribution  defined  for  P  will  be  a  function  of  the  same  (n  —  1) 
unique  random  variables  that  populate  each  of  its  rows,  as  will  any  joint  distribution 
defined  for  a  given  column  of  QA.  The  respective  stochastic  and  transpose  stochastic 
natures  of  these  matrices  means  that  an  entire  row  or  column  vector  of  random  variables 
will  be  jointly  distributed  over  a  standard  (n  —  l)-simplex  (see  Definition  23  above),  since, 
for  example,  the  nth  entry  pj  randomly  drawn  in  each  row  vector  p  in  the  prevalence 

n— 1 

matrix  P  is  a  function  pn  =  1  —  pj  of  the  other  (n  —  1)  random  variables  in  the  row 

j  =  i 

which  are  randomly  drawn  before  it. 

1.3  Problem  Statement 

Before  proceeding  further,  let  us  motivate  the  need  for  the  preceding  definitions  by 
means  of  the  following  situation.  Imagine  a  stockbroker  analyzing  the  contents  of  a  certain 
client’s  portfolio,  implementing  an  algorithm  that  classifies  stocks  as  either  buy,  sell,  or 
hold.  Although  unknown  to  the  broker  or  the  directors  of  the  corporations  whose  stocks  she 
analyzes,  there  are  a  set  of  seemingly  insignificant  factors  that,  when  occurring 
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simultaneously,  create  severe  danger  of  financial  ruin  for  many  of  these  corporations.  Her 
classification  system  was  created  to  detect  just  such  problems,  however,  and  it  reports  that 
85%  of  stocks  in  this  particular  portfolio  are  sell  stocks — i.e.,  stocks  that  ought  to  be  sold 
immediately.  Since  the  broker  has  never  seen  numbers  for  the  sell  class  greater  than  10%, 
she  begins  to  question  the  results,  and  therefore  does  not  immediately  sell  those  stocks. 
Time  ticks  by,  and  it  becomes  more  readily  apparent  to  the  corporations  and  the  broker 
that  the  stocks  are  highly  over-valued,  and  the  window  of  opportunity  to  sell  with  minimal 
loss  shrinks  away  overnight. 

If  the  broker  in  this  case  had  been  informed  beforehand  that  the  classification  system 
she  used  had  been  selected  and  tuned  specifically  to  the  cost  structure  dictated  by  her 
management,  and  that  the  distribution  of  stock  class  prevalences  provided  for  the 
possibility  of  unknown  factors  causing  a  change  in  stock  class  prevalences,  she  might  have 
had  more  confidence  in  the  classification  system,  and  then  acted  immediately  to  avoid 
losing  more  money  for  her  client,  because  her  risk  was  already  minimized  by  acting  on  the 
results  of  the  classification  system. 

Although  the  scenario  above  might  be  unrealistic,  there  are  many  classification 
situations  which  entail  potentially  much  greater  costs,  e.g.,  from  the  loss  of  life  (military 
applications  are  just  one).  However,  many  popular  methods  of  comparing  classification 
systems  to  one  another  do  not  consider  the  whole  picture  of  risk — i.e.,  the  costs  and  class 
prevalences  in  addition  to  an  estimate  of  the  class-conditional  probabilities.  In  addition  to 
these  oversights,  and  due  to  the  fact  that  volume  under  a  ROC  surface  (VUS)  in  a 
three-class  case  would  have  six  dimensions,  visualization  of  geometric  surfaces  becomes 
impossible,  so  ignoring  more  than  just  one  of  the  entries  per  column  of  a  conditional 
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probability  matrix  estimate  is  also  sometimes  chosen  as  an  alternative  [21].  Most  attempts 
to  generalize  geometric  concepts  to  the  general  n-class  case  choose  to  ignore  either  the  class 
prevalences  or  the  costs  [7],  [9],  [35].  If  Assumptions  1,  2,  3,  4,  and  5  are  relatively  safe 
assumptions  to  make,  then  the  concept  of  risk  offers  the  opportunity  a  much  more  robust 
form  of  ROC  analysis;  i.e.,  one  which  considers  many  more  of  the  characteristics  of  the 
operating  environment  in  which  the  receiver  of  information  resides. 
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II.  Review  of  Related  ROC  Analysis  Topics 


The  monograph  by  Egan  serves  as  a  starting  point  for  modern  binary  ROC  analysis.  It 
contains  much  of  the  terminology  and  geometry  still  in  use  today,  as  well  as  a  framework 
for  risk  calculations  [6,  p.  16].  Based  on  his  work,  for  a  given  classification  system  A  and 
accompanying  conditional  probability,  cost,  and  prevalence  matrices  QA,  C,  and  P  as 
given  in  Definitions  9,  17,  19,  and  20  of  Chapter  I,  respectively,  we  define  the  risk  R(A)  of 
a  classification  system  A  (suppressing  notational  dependence  on  A)  as  : 

R  =  (q,(C©P)}f  (10) 

with  Matrix  Hadamard  Product  0  and  Frobenius  Dot  Product  {,  )f  as  defined  in 
Chapter  I,  Definitions  21  and  22,  respectively.  Egan  notes  that  (10)  gives  the  expected  cost 
over  a  sufficiently  large  number  of  trials;  therefore,  from  this  point  onward,  we  shall 
assume,  as  in  Chapter  I,  Assumption  2,  that  such  is  the  case  [6,  pp.  16-17]. 

2.1  Two-Class  ROC  Analysis 

Assume  the  following  notation  of  a  transpose  stochastic  confusion  matrix  for  some 
binary  classification  system  A  (again,  suppressing  notational  dependence  on  A): 


q7|i 

qp2 

tpr 

fpr 

q2|i 

9. 2|2 

fnr 

tnr 

(ii) 


where  class  1  assumes  the  role  of  the  so-called  positive  or  target  class,  and  class  2  is  the 
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negative  or  non-target  class,  thereby  leading  to  the  abbreviations  for  true  positive  rate 
(tpr),  false  negative  rate  (/nr),  false  positive  rate  ( fpr ),  and  true  negative  rate  ( tnr )  in 
common  use  today  [10],  [32],  Since  this  matrix  is  transpose  stochastic,  its  information  may 
be  represented  by  only  one  entry  from  each  column.  Although  there  are  four  possible  ways 
to  do  this,  the  common  way  is  to  plot  (fpr,  tpr)  as  in  Figure  3,  so  that  the  coordinate 
representation  of  a  perfect  classifier  is  at  (0, 1).  In  this  coordinate  system,  maximal  area 
beneath  the  lines  connecting  a  plotted  point  for  a  given  classification  system  to  the  corners 
(0,0)  and  (1,1)  is  seen  as  desirable  [8,  pp.  108-110]. 

Based  on  this  geometrical  frame  of  reference,  one  of  the  most  popular  means  of 
evaluating  classification  system  effectiveness  is  by  the  Area  Under  the  ROC  Curve  (AUC) 
performance  measure,  which  calculates  geometrically  the  area  under  the  convex  hull  of  a 
collection  of  ROC  vector  estimates  plotted  in  this  way  to  represent  a  family  of  binary 
classification  systems  [10],  [13],  [26].  Instead  of  analyzing  collections  of  ROC  vectors, 
consider  the  case  with  just  one  plotted  ROC  vector  [8,  pp.  108-110],  as  in  Figure  3. 


ROC  Curve  for  One  ROC  Point  Estimate 


Figure  3:  ROC  Curve  for  One  ROC  Point  Estimate. 

The  area  under  this  ROC  curve  is  simply  the  sum  of  a  square  and  two  triangles: 
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AUC  =  (1  —  fpr)[tpr)  +  {fprfPr)  +  (1  -  Mil  -  M 

[2(1  —  fpr)  (tpr)]  +  [(/pr)(fpr)]  +  [(1  —  fpr)(l  —  fpr)] 

2 

2  [tpr  —  ( fpr)(tpr )]  +  ( fpr)(tpr )  +  1  —  tpr  —  fpr  +  (tpr)(fpr) 

2  (12) 
2  (tpr)  —  2  (fpr)  (tpr)  +  (fpr)  (tpr)  +  1  —  tpr  —  fpr  +  (tpr)  (fpr) 

2 

tpr  +  (1  —  fpr) 

2 

tpr  +  tnr 
2 


where  the  last  observation  is  made  possible  by  the  conjunctive  equation  fpr  +  tnr  =  1 
pertaining  to  the  left  columns  in  (11)  above.  Now,  if  we  assume  equal  class  prevalences 
Mi  =  M  =  M2  we  may  write: 


AUC 


tpr  +  tnr 


TP 

_  Mi 

2 

,  TN 

_l"  m2 

TP 

M 

2 

,  TN 

M 

2 

TP 

+  TN 

2  M 

TP 

+  TN 

Mi 

+  m2 

=  Accuracy 


(13) 


Accuracy  is  related  to  risk  through  the  AUC,  as  seen  when  we  calculate  the 


approximate  risk  i?  ~  Q  .(COP)),,  (per  Assumption  2,  Chapter  I)  indicated  by  (10) 


for  a  zero-one  cost  matrix  C  = 


0  1 
1  0 


under  equal  class  prevalences: 
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R™(  Q,(C0P))f 

/  tpr  fpr 
\  fnr  tnr 

tpr  fpr 
fnr  tnr 

(0  +  fpr)  +  ( fnr  +  0)  (^) 

“  2 

(1  —  tnr)  +  (1  —  tpr) 

~  2 

tpr  +  tnr 
2 

=  1  -  AUC 


where  the  last  relation  again  is  made  possible  by  the  conjunctive  equations  from  (11) 
above;  therefore,  by  (12)  above,  the  risk  R  for  a  zero-one  cost  matrix  and  equal  class 
prevalences  is  simply  (1  —  Accuracy)  under  the  same  assumptions,  and  (1  —  AUC)  in 
general  for  a  ROC  curve  with  only  one  point.  It  is  interesting  to  note  that  if  the  coordinate 
pair  used  to  represent  the  information  of  the  transpose  stochastic  matrices  in  (11)  were 
(fpr,  fnr)  instead  of  (fpr,  tpr) ,  the  calculation  in  (14)  would  yield  R  ~  AUC. 

Neither  the  AUC  nor  Accuracy  consider  costs,  but  unlike  Accuracy,  the  AUC  also 
does  not  consider  class  prevalences  in  its  calculation,  and  so  the  AUCs  for  two  very 
different  classification  systems  may  be  equal,  as  shown  in  Figure  4. 

There  is  some  merit  to  the  idea  that  the  conventional  formula  for  Accuracy  considers 
class  prevalences,  but  it  still  ignores  the  costs,  and  for  that  reason  is  incomplete  as  a 
measure  of  risk  [10],  [14],  [19],  [25].  It  is  also  not  robust  to  changes  in  class  prevalences 
when  extended  to  a  classification  system  with  more  than  2  classes.  There  are  quite  a  few 
other  performance  measures  related  to  Accuracy  or  the  AUC  which  we  shall  not  mention, 
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Two  ROC  Curves  with  Equal  AUC 


Figure  4:  Two  ROC  Curves  with  Equal  AUC. 

due  to  the  similarity  of  their  weaknesses  to  changes  in  class  prevalences  and  differing  costs. 
Proceeding  in  this  manner,  it  becomes  apparent  that  any  other  ROC  analysis  calculation 
based  on  Accuracy  or  the  AUC  is  equivalent  to  a  risk  calculation  with  certain  restrictions 
on  the  values  of  the  cost  and  prior  information. 

In  general,  it  appears  that  none  of  the  binary  ROC  analysis  methods  in  popular  use 
today  truly  utilize  the  significant  Bayesian  inputs  of  costs  and  prior  probabilities. 

2.2  Multi- Class  ROC  Analysis 

Extending  classical  ROC  analysis  beyond  the  realm  of  binary  classification  is  difficult. 
Some  authors  have  proposed  using  only  one  entry  per  column  of  a  3x3  transpose 
stochastic  confusion  matrix,  eliminating  most  of  the  ROC  information,  and  explicitly 
considering  neither  costs  nor  class  prevalences  in  their  calculations  [21,  pp.  80-82],  [22,  p. 
3441]  [34,  p.  4],  More  recent  approaches  consider  either  the  costs  or  the  priors  as  one  of 
the  parameters  in  the  threshold  set,  ignoring  the  effect  of  the  other,  or  suggest  plotting 
different  curves  for  each  pair  of  classes  as  done  in  binary  ROC  analysis  [9],  [12],  [36]. 
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Due  to  the  conjunctive  equations  accompanying  any  conditional  probability  matrix, 
the  Volume  Under  the  Surface  (VUS)  for  an  n-class  scenario  is  only  a  true  extension  of  the 
AUC  when  it  is  (n2  —  n)-dimensional,  however,  some  authors,  in  order  to  produce  visible 
surfaces,  plot  only  a  3-dimensional  surface  for  a  3-class  system  [4],  [21],  [22],  Some,  who 
realize  the  weakness  such  a  scheme  entails,  allude  to  the  calculation  of  risk;  however,  the 
calculation  is  not  performed,  because,  for  example,  the  need  to  assume  unknown  costs  is 
deemed  important.  In  general,  breaching  Assumptions  1,  2,  3,  4,  and  5  from  Chapter  I  is 
never  mentioned  as  a  cause  for  not  calculating  the  risk  [8],  [9]. 

2.3  Need  for  a  ROC  Risk  Functional 

The  world  of  classical  ROC  analysis  seems  to  be  stuck  unnecessarily  in  a  frame  of 
reference  that  considers  geometrical  analyses  as  the  gold  standard  of  ROC  analysis 
methods,  when  in  fact,  if  Assumptions  1,  2,  3,  4,  and  5  from  Chapter  I  can  be  met,  the 
comparison  of  classification  systems  by  risk  comparison  is  not  simpler  and  more 
comprehensive.  In  addition,  risk-based  comparison  of  classification  systems  falls  closer  to 
the  reach  of  the  ordinary  decision-maker,  who  is  usually  not  involved  in  obtaining  estimates 
of  class-conditional  probabilities,  but  usually  is  responsible  for  defining  costs  and  may  even 
have  some  knowledge  of  class  prevalence  distributions.  When  risk-based  comparisons  of 
classification  systems  are  implemented,  the  result  is  much  simpler,  as  well  as  more 
considerate  of  the  crucial  role  of  both  costs  and  class  prevalences  in  ROC  analysis  [32], 


III.  Development  and  Definition  of  the  ROC  Risk  Functional 

The  ROC  Functional  /a  suggested  in  [31]  and  [32]  for  a  family  A  of  classification  systems 
is  a  ROC  analysis  method  which  minimizes  risk.  Additionally,  it  was  proposed  in  [32]  to 
allow  the  Hadamard  Product  j  =  (c  0  p)  of  vectors  of  costs  and  class  prevalences, 
constructed  in  a  particular  way,  to  vary  over  a  range  V  =  {7 :  7  =  c  0  P},  along  with 
restrictions  on  cost.  This  Robust  Functional  implicitly  incorporated  Assumptions  1,  2,  and 
4  from  Chapter  I,  but  did  not  incorporate  Assumptions  3  and  5  from  that  Chapter,  which 
made  the  problem  more  difficult.  The  equation  was  written  as: 

R(/a,  r)  =  min  [  (q,7)W(7)d7  (15) 

qeQ  Jr 

where  Q  is  a  collection  Q  =  {q  :  q  G  Q}  of  ROC  vectors  corresponding  to  the  family  A 
of  classification  systems  and  W (7)  is  a  joint  weighting  function,  of  the  cost-prior 
Hadamard  Product  vector  7  =  c  0  p,  cast  either  as  a  probability  density  function  or  a 
belief  function  [32,  p.  6]. 

In  addition  to  the  implicit  incorporation  of  Assumptions  1,  2,  and  4  from  Chapter  I,  if 
we  also  incorporate  Assumptions  3  and  5  from  the  same  Chapter,  we  may  fashion  the 
distributions  of  costs  and  priors  independently  from  one  another  by  making  over  q,  c,  and 
p  in  (15)  above  to  be  the  random  matrices  QA,  C,  and  P  (see  Definitions  24,  17,  19,  and 
20  from  Chapter  I).  Without  explicitly  denoting  dependence  on  the  classification  system 
A,  we  define  marginal  weighting  functions  Wq  (Q) ,  ITc(C),  and  W/P(P).  Since  these 
marginal  distributions  are  defined  for  sets  of  random  variables  assumed  independent  from 
one  another,  they  satisfy  the  separability  condition  [33,  p.  245]: 
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Wq,c,p(Q,C,P)  =  Wq(Q)Wc(C)Wf(P) 


We  shall  now  examine  possible  joint  probability  density  functions  Wp(P)  on  the 

n 

priors  such  that  the  constraints  of  the  conjunctive  equation  ^^pj  =  1  are  met. 

j  =  i 

Note  that  (15)  simply  calculates  Bayes  risk,  or  the  expected  value  of  Equation  (10), 
Chapter  II.  Without  explicitly  denoting  dependence  on  the  classification  system  A  or 
functional  dependence  on  the  variables  in  the  matrices  QA,  C,  and  P.  we  may  write: 


E(R)  =  E 


Q,  (C  0  P) 


J  j  J (Q.  (C  ©  P))F  Wq,c,p  dQdC  dP 


/  / /tlZhN.UB) 

J  J  J  i=l\j  =  l 

E  (tiff  [  Cl  i  j  c i  j  Pi  (WQWCWP  dQ)  dCdP 

i=l  vj  =  l  J  J 

±{±  I PjlTp /c. Ii»c dQdCdP 
E(E  (  / PjW'pdp)  (  Jc,nwcdc)  (J q.uW'qdQ 
b[k])  {EAi])  (B[«Jiii] 

E(  Cid  ^[pj 


W'Q.c.p  dQdCdP 


i  —  1  \j  =  l 

n  /  n  r 

E(E 

i=  1  V  j  =  1 


E 

i  =  1 


j  =  l 


=  (E  [Q]  ,  E  [C]  ©E  [P; 


Q  ,  C  0  E  [P] 


(16) 
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where  the  boldface  expected  value  E[-]  denotes  a  matrix  [.£(•)]  of  expected  values,  and 
where  we  have  introduced  the  notation  of  integration  with  respect  to  a  matrix,  such  that 
J  [  •  ]  dX  denotes  the  multiple  integration  operator: 


/ 


dx n  dxi2  .  .  .  dx ir 


dxSI 


with  respect  to  all  of  the  variables  in  the  matrix  X  of  size  s  x  r,  such  that  dX  denotes  the 
product  of  all  differential  elements  dx^  of  variables  in  X,  V  ie  {1,2,3,  ...  ,  s}  and 
j  G  {1,2,3,  .. .  ,  r}.  Note  that  without  Assumptions  1,  2,  3,  4,  and  5  from  Chapter  I,  we 
could  not  perform  this  simple  dot  product  calculation  for  Bayes  risk  [33,  p.  233-246]. 


3.1  Definition  of  the  ROC  Risk  Functional 

Given  a  family  Aq  =  {Ag:  6  G  0}  of  n-class  classification  systems  of  form 
Ag :  E  — >  L  over  a  threshold  set  0,  with  common  cost  and  prevalence  matrices  C  and  P 
and  a  collection  |qA()  :  9  G  0  j  of  conditional  probability  matrices,  as  defined  in  Chapter 
I,  Definitions  11,  10,  19,  20,  and  17,  respectively,  define  the  ROC  Risk  Functional  (RRF)  as 
a  threshold  parameter  9  G  0  such  that  the  classification  system  Ag  minimizes  Bayes  risk 
over  the  family  Ae  of  classification  systems: 


arg  min  {  E  \  Ra0  1  ?  =  arg  min  <  E 
AeeA©l  L  eJJ  AeeAel 

hn  <  ( 

GA©  \ 


Qa,  ,(C0P 


(17) 


arg  mill  {  (  Qa,  ,  C  0  E  [P 

AgGA©  \  If 


As  a  result  of  Assumptions  1,  2,  3,  4,  and  5  from  Chapter  1,  expected  values  for 


elements  of  the  cost,  prevalence,  and  conditional  probability  matrices  may  all  be  analyzed 
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and  estimated  independently  of  elements  of  any  other  matrix  appearing  in  (17),  prior  to 
using  them  in  calculation  of  Bayes’  risk  when  employing  the  RRF. 

We  now  consider  the  effect  on  E  [P]  of  varying  our  assumptions  on  P.  These 
assumptions  may  take  many  different  forms.  For  example,  we  may  simply  consider  that  we 
already  have  an  acceptable  estimate  of  P,  and  treat  it  as  a  constant,  bringing  us  back  to  a 
form  like  that  of  Equation  (10),  Chapter  II.  We  may  also  populate  its  rows  with  the 
transpose  mean  vector  of  a  joint  statistical  distribution,  such  as  a  joint  uniform  distribution 
representing  no  knowledge  of  prior  probabilities,  or  some  other  jointly  continuous 
fixed-support  probability  distribution  function,  such  as  a  multivariate  Beta  distribution. 
Finally,  we  may  simply  impose  a  joint  weighting  based  on  expert  knowledge  and  belief 
(a.k.a.,  a  belief  function,  which  is  actually  a  more  general  type  of  weighting  than  a 
probability  distribution  function,  with  potentially  greater  utility  for  actual  end-users  of 
classification  systems  [28,  pp.  38-39].  In  the  latter  case,  we  do  not  end  up  with  a  classical 
risk,  but  rather  a  fuzzy  risk.  Since  the  case  where  all  random  variables  in  the  prevalence 
matrix  P  are  constants  is  a  matter  of  simple  algebra,  we  shall  examine  a  small  sampling  of 
more  interesting  possibilities. 

3.2  Completely  Unknown  Class  Prevalences 

As  noted  in  Chapter  I,  for  an  n-class  classification  system,  exactly  (n  —  1)  of  the  class 
prevalences  are  distributed  over  a  standard  (n  —  l)-simplex,  and  the  remaining  class 
prevalence  is  found  by  solving  the  conjunctive  equation  inherent  in  each  row  of  the 
stochastic  matrix  P  of  class  prevalences.  Observing  Theorem  2,  Appendix  A: 
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dp  m  ...  dps  dp2  dpi 


(18) 


m!  = 


'1  fl—  PI  fl— PI—  P2  Ej  =  l(Pj) 
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we  see  the  integral  in  the  denominator  of  (18)  is  simply  over  the  standard  m-simplex 
Am  C  [0,  l]m  (see  Definition  23,  Chapter  I).  Apply  this  observation  and  (18)  to  the 
standard  simplex  An_i  from  which  the  first  (n  —  1)  class  prevalences  are  drawn,  yielding: 


1 

(n  —  1)! 


dp  n-l 


dp-3  dp2  dp  3 


(19) 


Assuming  nothing  whatsoever  is  known  about  the  prior  probabilities,  a  jointly 
continuous  uniform  probability  density  function  l-l'p, uniform  (Pb  •  •  •  ,  pn-i)  °f  (n  —  1)  class 
prevalences  over  the  standard  (n  —  l)-simplex,  satisfying 

fAn  l  I'Dp. uniform  (pi  j  •  •  •  ,pn-i)  rfpn-i  •  •  •  dp,  =  1  is  then  given  by  [3,  p.  568]: 

(n-l)!,  (pi,  •••  j P n— 1)  e  An_i 

ff  P, uniform  (Pi;  •••  iPn-l)  \  (20) 

0,  otherwise 

If  we  consider  the  quantity  E  [P]  appearing  in  (17)  above  to  be  the  matrix  of 
expected  values  whose  typical  element  is: 


/  Pij  bf  Pj , uniform  (p  lj  i  • 

••  ?Pn— l,j)  ^Pn- l,j 

. . .  dp  ,j 

J  An_r 

we  may  simplify  our  evaluation  of  these  integrals  by  recalling  that  each  row  is  identical  and 
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n— 1 


that  all  entries  in  a  given  row  i,  save  p;n  =  1  —  py  ,  are  the  class  prevalences 


{py}-1-]1)  for  all  rows  i  =  1,  2, 3  . . .  ,  n.  Also,  since  the  rows  are  identical,  there  is  no  need 
to  keep  the  subscript  i  when  referring  to  a  prior  probability  pj  for  class  £j. 


j  =  i 


It  is  crucial  to  state  here  that  even  though  we  use  the  words  first  and  last  to  describe 
the  class  prevalences,  the  order  in  which  the  so-called  first  (n  —  1)  class  prevalences  are 
randomly  drawn  from  their  joint  distribution  has  nothing  to  do  with  the  ordering  of  the 
index  set  A  by  which  we  link  them  to  elements  of  the  label  set. 

n— 1 

Since  pn  =  1  —  pj  is  a  function  of  the  (n  —  1)  class  prevalences  whose  joint 
j  =  i 

distribution  is  W~p, uniform  (pi ,  •  •  •  ,  pn-i),  we  may  use  the  weighting  in  (20)  to  calculate  an 
expected  value  E( pn),  using  some  of  the  equation  patterns  seen  in  the  proof  of  Theorem  2, 
Appendix  A: 


n— 1 


E  (pn)  =  E  |  1  -  Pi 
j  =  i 


/....(■ -s*j 


=  (n  —  !)• 


=  (n  —  !)• 


i  />i-pi 


o  Jo 


Wp(pi  .  .  .  Pn-1 

)  dp  n— ; 

(n  -  1)!  dp  n—i 

. . .  dpi 

n— 1  \ 

. . .  dpi 

j=l  / 

/■'-EAl2  (Pj) 

(  (h 

/ 

Jo 

(n  —  1)!  (  —  ) ,  by  (28)  and  (29),  Theorem  2  proof,  Appendix  A 
\n! 

1 
n 


(21) 
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We  may  calculate  expected  values  for  the  other  (n  —  1)  class  prevalences  in  a  row  by 
means  of  Corollary  1,  Appendix  A,  which  is  known  to  be  true  for  positive  integers  less  than 
48  (i.e.,  for  most  practical  classification  purposes): 


(m  +1)!  = 


i  ('l-pi  pi  P2  rE^Lil Pi) 


r 1  p-pi  r 

Jo  Jo  Jo 

1 

I  A  m  Pj  dP 


P>dPr 

V  j  =  1,2,3,  ...  ,m 


•  •  dp3  dp-2  dp  i 


Apply  (22)  to  each  of  the  expected  values  A’(pj),  j  =  1,  ...  ,  n  —  1: 


(22) 


^(Pj)  =  /  Pj  Up  (pi ,  •••  j  P  n— l)  dPn-1  •••  dpi 


'  An_i 


'  An — i 


pj  (n-  l)!dpn_i  ...  dp! 


=  (n  -  1)!  /  pj  dpn_!  . . .  dpi 
J  An_i 


=  ("-1)Mnr 


=  -,  Vj  =  1,  ...  ,n-  1 
n 


(23) 


so  by  (21),  each  entry  in  the  n  x  n  matrix  E[Pa]  is  exactly  ^  ;  therefore,  if  we  set  all 
class  prevalences  equal  to  begin  with,  the  resultant  expected  value  matrix  is  the  same  as 
when  we  assume  an  underlying  multivariate  uniform  distribution  over  An_!. 


3.3  Limited  Knowledge  of  Class  Prevalences 

Probability  density  functions,  such  as  the  multivariate  uniform  and  Beta 
distributions,  are  only  a  specific  kind  of  “weighting”  function  Wp  to  be  used  in  evidential 
or  probabilistic  reasoning  regarding  the  class  prevalences,  since  the  classical  structure  of 
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probability  is  only  a  specific  instance  of  an  infinite  number  of  ways  to  approach  the  so-called 


“doctrine  of  chances”  [2],  [28].  This  is  the  reason  we  have  chosen  to  denote  the  weighting 
function  as  Wp  instead  of  the  usual  symbol  fp  for  a  marginal  probability  density  function 
of  the  class  prevalences.  Even  though  we  leave  the  framework  open  for  expansion,  we  will 
only  consider  one  additional  classical  distribution — a  multivariate  general  Beta. 

The  marginal  versions  of  a  general  Beta  distribution  are  very  flexible  and  may  even 
be  made  to  approximate  normal  distributions  over  limited  support  intervals.  The  standard 
univariate  Beta  distribution  has  support  on  [0, 1],  and  thus  has  a  set  of  two  parameters  (for 
shape),  but  the  general  form  has  four  parameters,  because  it  includes  two  parameters  Siower 
and  SUpper  giving  the  bounds  of  the  support  interval  [lower,  upper]  C  R  over  which  it  is 
defined  [15,  p.  210].  In  this  thesis  we  shall  always  define  the  lower  support  bound  to  be 
lower  =  0,  effectively  reducing  the  number  of  parameters  to  three;  further,  we  shall  also 
consider  only  those  marginal  Beta  distributions  that  have  the  potential  to  approximate  a 
normal  distribution  with  a  mean  of  ( up^er )  over  their  support  intervals;  i.e.,  those 
symmetric  about  the  midpoint  of  their  support.  This  means  the  two  shape  parameters  are 
equal,  so  we  have  reduced  the  total  number  of  possibly  unique  parameters  to  two — one  for 
shape,  and  one  for  support.  We  shall  denote  this  special  case  of  the  general  Beta 
distribution  as  /3(t,S),  where  t  is  the  value  of  the  common  shape  parameter,  and  S  is  the 
upper  bound  of  the  support  interval  [0,  S].  In  the  case  where  a  joint  probability  density 
function  Wpp  is  defined  over  a  standard  simplex  Am,  we  shall  indicate  such  joint  support 
by  the  notation  /3(t,  Am),  where  t  e  (0,  oo)m  is  a  vector  of  the  common  shape  parameters 
used  in  the  symmetric  marginal  probability  density  functions. 

The  support  parameter  of  the  marginal  distributions  of  the  first  (n  —  1)  class 
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prevalences  randomly  drawn  over  an  (n  —  l)-simplex  is  a  function  of  all  prevalences 
previously  drawn.  In  fact,  it  is  because  of  this  that  expected  value  calculations  for  any 
function  /(pi,  . . .  ,  pn-i)  may  then  be  performed  by  means  of  the  operator: 


(  •  )  Ihp  dpn— 1 


dpi 


(  •  )  Wp(pi,  .  .  .  ,  P n— l)  dpn_i  ...  dpi 


r1 

/  W’p.i  •  • 
/  0 

_ _ 0 

W-'JVn-l  dpn_i 

. . .  dpi 


(24) 


by  decomposing  the  joint  distribution  H  p (pi,  . . .  ,pn-i)  into  a  form  allowing  elimination 
of  one  variable  at  a  time,  working  from  the  inside  of  the  integral  toward  the  outside: 


Wp(pi,  .  .  .  ,pn_i)  =  Wp,i(Pi)  fhp)2(pi,  P2)  •••  fbpn_i(pi,  ...  ,  Pn-l) 


(25) 


Considering  a  3-class  scenario,  we  attempt  to  approximate  a  bivariate  normal 
distribution  of  the  first  two  class  prevalences  randomly  drawn  from  the  standard  2-simplex. 
Figure  5  depicts  a  bivariate  /3([5,  290],  A2)  probability  distribution  function 
bFp!/3i(5j29o),A2  (pi,  P2)  of  two  class  prevalences  over  the  standard  2-simplex.  The  values  of 
the  common  shape  parameters  for  the  marginal  probability  density  functions  were  chosen 
after  examining  Figures  6  and  7. 
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IV.  Application  of  Results  to  Actual  Data 


The  Fisher  Iris  Data  is  a  well-known  data  set  consisting  of  four  measurements  (in 
millimeters)  of  various  physical  attributes  for  three  subspecies  of  iris  flowers — namely,  Iris 
Setosa,  Iris  Versicolor,  and  Iris  Virginica.  There  are  50  such  sets  of  measurements  per 
species,  allowing  for  great  flexibility  when  varying  class  distributions  such  that  the  data  set 
is  always  of  significant  size.  Principal  components  analysis  (PCA)  of  the  data  reveals  that 
the  first  two  principal  components  account  for  about  98%  of  the  variance  in  the  data; 
therefore,  we  sped  up  computation  by  only  using  the  component  scores  from  these  two 
components.  The  PCA  scores  for  these  two  components  are  shown  in  Figure  8. 


First  Two  Principal  Component  Scores,  Plotted  Pairwise 


First  Principal  Component  Score  (Eigenvalue  422.8) 


Figure  8:  First  Two  Principal  Component  Scores,  Fisher  Iris  Data. 

We  used  Probabilistic  Neural  Net  (PNN)  classifiers  trained  with  data  distributed 
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amongst  the  classes  according  to  a  set  of  three  positive  numbers  summing  to  1,  the  first 
two  of  which  were  drawn  from  a  specific  bivariate  distribution  assumed  to  exist  over  the 
standard  2-simplex,  rounding  determining  the  actual  prevalence,  which  may  not  be  exactly 
equal  to  the  goal  prevalence  actually  drawn  clue  to  the  fact  that  one  cannot  classify  a 
partial  exemplar.  We  cast  Iris  Setosa  as  Class  1,  Iris  Versicolor  as  Class  2,  Iris  Virginica  as 
Class  3,  applying  the  uniform  and  symmetric  Beta  distributions  examined  in  Chapter  III, 
Sections  3.2  and  3.3  to  the  PCA  score  data  to  test  the  validity  of  Assumption  1  from 
Chapter  I,  while  comparing  the  performance  of  the  ROC  Risk  Functional  (RRF)  to 
Accuracy. 

To  test  the  validity  of  Assumption  1  from  Chapter  I,  we  used  the  non-parametric 
Kendall’s  Tau  Correlation  Coefficient  statistical  test,  with  a  null  hypothesis  of  no 
dependence  between  a  class-conditional  probability  estimate  qjp  and  the  prevalence  pj  of 
the  class  £j  upon  which  it  is  conditioned  [11,  pp.  404-405].  Note  that  we  assumed  it 
sufficient  to  test  only  between  a  conditional  probability  estimate  (A0)  and  the 
prevalence  pj  of  the  class  upon  which  it  is  conditioned,  since  the  class  prevalence  pj 
actually  appears  in  the  formulas  for  q^,  V  i  =  1,  . . .  ,  n  (see  Definition  16,  Chapter  I). 
Under  this  assumption,  nine  separate  Kendall’s  Tau  tests  were  performed  after  each  set  of 
37  replicates,  testing  for  independence  between  each  of  the  9  sets  of  37  class-conditional 
probabilities  and  a  similar  population  of  actual  class  prevalences  for  the  class  upon  which 
they  are  conditioned,  reporting  the  mean  absolute  value  of  Kendall’s  Tau  Correlation 
Coefficients  and  corresponding  mean  p- values  from  those  tests  in  a  pair  of  9  x  9  matrices. 

A  Monte  Carlo  simulated  power  analysis  algorithm  provided  a  99%  confidence 
interval  of  (0.8083,  0.8282)  for  the  power  of  a  test  with  37  sample  points  and  an 
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alternative  hypothesis  that  the  absolute  value  of  Kendall’s  Tau  Correlation  Coefficient  was 
as  great  as  0.4  and  considering  p- values  of  less  than  0.15  statistically  significant. 

For  both  the  uniform  and  beta  scenarios,  we  trained  a  PNN  classifier  on  the  subsets 
of  the  data  set  derived  by  drawing  the  first  two  class  prevalences  randomly  from  the 
appropriate  bivariate  distribution  until  its  randomly  determined  membership  count  was 
met,  maximizing  the  overall  size  of  the  training  data  set  according  to  the  constraints  of  the 
randomly  determined  prevalences.  We  disallowed  instances  of  zero  class  membership  for 
any  class,  since  Assumption  1  from  Chapter  I  only  applies  to  non-zero  class  prevalences. 

We  validated  the  classifiers  via  the  Lachenbruch  holdout  method,  which  yields  a  very 
precise  estimate  of  Error,  called  the  “Actual  Error  Rrate”  (AER),  where 
Error  =  (1  —  Accuracy)  [1],  [17,  p.  4].  In  this  method,  the  classifier  is  trained  on  all  but 
one  exemplar  at  a  time  and  then  that  exemplar  is  classified  to  populate  the  contingency 
matrix.  After  the  contingency  matrix  is  completely  populated  in  this  way  by  repeating  the 
Lachenbruch  holdout  procedure  for  each  exemplar  from  a  randomly  chosen  set  of  training 
data,  the  transpose  stochastic  confusion  matrix  was  formed;  then,  the  entire  process  listed 
above  was  repeated  a  total  of  37  times,  for  both  the  uniform  and  beta  scenarios. 

A  PNN  uses  a  continuous  threshold  parameter  called  the  spread.  This  is  the  common 
standard  deviation  of  the  small  multivariate  normal  probability  density  functions  that  are 
constructed  with  each  training  exemplar  as  the  mean  vector,  then  summed  and  normalized 
to  form  the  “Parzen  Window”  probability  density  functions  for  each  class  during 
training  [5,  pp.  164-166].  We  standardized  the  data  before  training  and  validation,  enabling 
us  to  vary  the  spread  parameter  for  the  PNN  from  0.001  to  1.001  with  confidence  of  not 
needing  to  go  any  larger  with  the  spread  [1] .  If  one  sought  to  classify  a  new  exemplar 
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according  to  such  a  classifier,  one  would  need  to  subtract  the  grand  mean  of  the  training 
data  and  divide  by  its  standard  deviation  to  obtain  a  standardized  form  of  the  exemplar. 

We  performed  a  “spread  study”  by  taking  10  equally-spaced  steps  of  0.1  between 
0.001  and  1.001,  performing  the  37  replications  mentioned  above  at  each  point  for  both 
the  uniform  and  beta  scenarios.  We  found  the  value  of  the  spread  parameter  such  that  a 
classification  system  based  on  that  parameter  minimized  Bayes  risk  under  a  certain 
assumed  cost  regime  and  class  prevalence  distribution.  The  conditional  probability  matrix 
estimate  used  for  a  classification  system  based  on  a  particular  spread  parameter  and  class 
prevalence  distribution  was  the  mean  of  the  37  confusion  matrices  produced  by  the 
experiment  at  that  particular  spread  parameter,  and  for  that  particular  distribution  of  class 
prevalences.  We  calculated  Bayes  risk  under  two  different  fixed-cost  regimes  for  each  of  the 
prevalence  distribution  scenarios,  but  when  the  assumed  distribution  of  prior  probabilities 
was  held  constant  between  risk  calculations  for  different  cost  regimes,  we  used  the  same 
conditional  probability  matrix  estimate  for  both  calculations.  The  two  cost  regimes  used 
are  shown  in  Tables  3  and  4. 

Table  3:  Cost  Regime  1. 


Cost  Regime  1 

r  Actual  Class:  1 

Actual  Class:  2 

Actual  Class:  3 

Labeled  Class:  1 

0 

5 

5 

Labeled  Class:  2 

l 

0 

1 

Labeled  Class:  3 

l 

1 

0 

Table  4:  Cost  Regime  2. 


Cost  Regime  2 

r  Actual  Class:  1 

Actual  Class:  2 

Actual  Class:  3 

Labeled  Class:  1 

l 

10 

2 

Labeled  Class:  2 

2 

l 

2 

Labeled  Class:  3 

2 

10 

1 
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4-1  Uniform  Distribution  Scenario 


As  shown  in  Figures  9  and  10,  spread  parameter  value  6  =  0.301  minimized  Bayes 
Risk  over  two  separate  fixed  cost  scenarios.  Figure  11  shows  that  this  same  value  of  the 
spread  also  minimized  the  AER. 


Cost  Regime  1  Bayes  Risks,  Uniform  Application 


Figure  9:  Bayes  Risks  for  Cost  Regime  1 ,  Uniform  Application. 


Cost  Regime  2  Bayes  Risks,  Uniform  Application 


Figure  10:  Bayes  Risks  for  Cost  Regime  2 ,  Uniform  Application. 
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Actual  Error  Rates,  Uniform  Application 


Figure  11:  Actual  Error  Rates,  Uniform  Application. 

As  we  can  see  from  the  mean  p-values  and  absolute  correlations  in  Table  5, 
Assumption  1  from  Chapter  I  appears  to  have  been  violated  in  quite  a  few  cases,  especially 
in  the  lower  right-hand  corner  of  the  table,  corresponding  to  classification  decisions 
involving  the  Iris  Versicolor  and  Iris  Virginica  species.  It  should  be  noted  that  for  the 
uniform  scenario,  lower  p-values  and  higher  correlations  appeared  in  the  areas  of  Table  5 
corresponding  to  these  species,  no  matter  how  we  rearranged  the  order  of  which  species 
were  assigned  to  which  class  numbers.  This  may  be  related  to  the  relative  difficulty  in 
distinguishing  between  these  two  species,  as  illustrated  in  Figure  8  above. 

Finally,  it  is  worth  noting  that  Accuracy-based  analysis,  wherein  the  goal  is  to 
minimize  the  AER,  yielded  no  different  results  than  the  RRF  in  this  case. 


45 


Table  5:  Mean  Absolute  Correlations  and  p- values,  Uniform  Application. 


Correlations 

Actual  Class:  1 

Actual  Class:  2 

Actual  Class:  3 

Labeled  Class:  1 

0.32 

0.32 

0.07 

Labeled  Class:  2 

0.29 

0.64 

0.52 

Labeled  Class:  3 

0.08 

0.47 

0.57 

p- values 

Actual  Class:  1 

Actual  Class:  2 

Actual  Class:  3 

Labeled  Class:  1 

0.20 

0.09 

0.78 

Labeled  Class:  2 

0.21 

0.00 

0.09 

Labeled  Class:  3 

0.75 

0.09 

0.00 

4-2  Beta  Distribution  Scenario 


As  shown  in  Figures  12  and  13,  spread  parameter  value  9  =  0.201  minimized  Bayes 
Risk  for  Cost  Regime  1,  and  a  different  spread  parameter  value  6  =  0.401  minimized  Bayes 
Risk  for  Cost  Regime  1.  It  is  interesting  to  note,  as  displayed  in  Figure  14,  that  yet  another 
spread  parameter  value  6  =  0.301,  which  was  near  the  mean  of  the  two  parameters 
minimizing  risk  under  the  two  cost  scenarios,  minimized  the  AER. 


Cost  Regime  1  Bayes  Risks,  Beta  Application 

3.5 1 - 1 - 1 - 1 - 1 - 1 - 


o' - 1 - 1 - 1 - 1 - 1 - 1 - 1 

0  0.2  0.4  0.6  0.8  1  1.2  1.4 

Spread  Parameter  Tested 


Figure  12:  Bayes  Risks  for  Cost  Regime  1 ,  Beta  Application. 
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Cost  Regime  2  Bayes  Risks,  Beta  Application 


Figure  13:  Bayes  Risks  for  Cost  Regime  2 ,  Beta  Application. 


Actual  Error  Rates,  Beta  Application 


Figure  14:  Actual  Error  Rates,  Beta  Application. 


As  we  can  see  from  the  mean  p-values  and  absolute  correlations  in  Table  6, 
Assumption  1  from  Chapter  1  appears  to  hold  quite  well  in  this  scenario.  It  is  interesting  to 
note  that  the  lowest  correlations  and  highest  p-values  occurred  in  the  (1,3)  and  (1,3) 
positions  of  Table  6.  This  may  be  related  to  the  fact  that,  as  shown  in  Figure  8  above,  the 
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Classes  1  and  3,  namely,  Iris  Setosa  and  Iris  Virginica,  are  difficult  to  confuse. 


Table  6:  Mean  Absolute  Correlations  and  p- values,  Beta  Application. 


Correlations 

Labeled  Class:  1 
Labeled  Class:  2 
Labeled  Class:  3 

p- values 

Labeled  Class:  1 
Labeled  Class:  2 
Labeled  Class:  3 


Actual  Class:  1 

0.09 

0.09 

0.00 

Actual  Class:  1 

0.73 

0.73 

1.00 


Actual  Class:  2 
0.16 
0.26 
0.19 

Actual  Class:  2 
0.62 
0.15 
0.30 


Actual  Class:  3 

0.04 

0.19 

0.23 

Actual  Class:  3 

0.91 

0.27 

0.18 


It  should  be  noted  here  that  if  the  manner  of  assigning  class  numbers  to  the  species 
varied  at  all  from  the  arrangement  listed  at  the  beginning  of  this  Chapter,  certain  p-values 
began  to  be  low  and  corresponding  correlations  high,  similar  to  the  case  with  the  uniform 
application.  This  may  be  related  to  the  fact  that  we  had  unequal  means  for  the  classes  in 
the  beta  scenario,  since  the  marginal  probability  distribution  for  the  first  class  chosen 
always  had  mean  (|);  thus,  when  a  species  that  was  not  as  easy  as  Iris  Setosa  to  classify 
was  chosen  as  Class  1,  it  tended  to  have  a  larger  class  prevalence,  and  thus  more  influence 
over  the  training  process. 
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V.  Conclusion  and  Research  Suggestions 


5. 1  Summary  of  Application  Results 

With  no  knowledge  of  class  prevalence  distributions,  persons  training  classifiers  may 
wish  to  assume  a  uniform  distribution.  However,  as  has  been  shown,  Assumption  1  from 
Chapter  1  may  tend  not  to  be  met  in  this  case,  particularly  if  there  are  quite  a  few  wrong 
decisions  being  made  by  the  classifier.  The  Kendall’s  Tau  test  seems  to  be  rather  sensitive 
to  these  mistakes,  and  the  fact  that  a  confusion  matrix  should  (in  the  ideal  case)  have  only 
a  few  non-zero  off-diagonal  entries  seems  to  create  a  situation  with  an  excessive  number  of 
ties.  The  relationship  between  the  relative  number  of  times  a  classifier  makes  a  mistake  and 
does  not  seems  to  hold  great  power  over  these  results;  for  example,  there  was  one  type  of 
classification  mistake  that  was  never  made  in  any  case  over  all  of  the  random  trials 
performed,  and  so  the  correlation  values  for  this  element  of  the  confusion  matrix  and  the 
prevalence  of  the  actual  class  over  which  the  element  was  normalized  was  always  exactly 
zero,  with  p- value  1.  However,  if  just  one  mistake  of  a  certain  type  occurred  during 
classifier  validation,  this  often  resulted  in  a  rather  high  correlation  and  a  rather  low  p-value 
for  the  independence  test  for  that  particular  class-conditional  probability  estimate. 

Regardless  of  the  fact  that  the  Assumption  1  from  Chapter  I  appears  to  have  been 
violated,  there  are  many  cases  in  which  the  basic  assumptions  of  an  applied  multivariate 
analysis  technique  may  often  be  seriously  violated,  and  yet  the  technique  based  upon  these 
assumptions  is  still  very  useful  and  informative  [1].  Therefore,  I  would  recommend  not 
eliminating  risk-based  comparison  of  classification  systems  until  an  investigation  into  such 
matters  can  be  made,  or  another  practical  method  of  calculating  risk  is  found  that  does  not 
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require  such  strict  independence  assumptions  (if  indeed  such  a  practical  method  exists). 


It  would  also  appear  that  a  more  informative  class  prevalence  distribution  than  the 
uniform  tends  to  yield  better  results  when  testing  Assumption  1  from  Chapter  I.  The 
method  for  training  classifiers,  wherein  training  data  is  randomly  chosen  according  to  an 
assumed  statistical  or  other  distribution,  may  have  applications  for  persons  involved  in  the 
development  of  classification  systems,  because  it  eliminates  human  bias  and  allows  the 
testing  of  said  Assumption. 

Based  on  the  results  of  this  thesis,  I  would  advocate  a  paradigm  shift  toward 
risk-based  comparison  of  classification  systems  in  the  field  of  ROC  analysis,  to  allow  both 
the  users  and  producers  of  classification  systems  to  have  more  confidence  in  the 
performance  of  these  systems. 

5.2  Suggestions  for  Further  Research 

Possible  areas  of  further  research  are: 

1.  The  field  of  belief  functions  may  be  more  accessible  to  end-users  as  a  potential  weighting 
function  on  prior  probabilities,  since  performing  statistical  experiments  may  be  too 
expensive  or  difficult. 

2.  The  framework  of  independence,  if  validated,  leaves  the  door  open  for  others  to  form 
and  test  risk  over  independently  analyzed  distributions  of  costs  and  class-conditional 
probabilities  as  well  as  class  prevalences,  if  indeed  such  distributions  may  be  found. 

3.  The  feasibility  of  calculating  the  ROC  convex  hull  (ROCCH)  as  a  time-saving  method 
for  near-real  time  analysis  of  classification  systems  is  still  in  question. 
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4.  The  possible  need  to  test  each  class-conditional  probability  for  independence  against  all 
class  prevalences,  not  just  the  prevalence  of  the  class  upon  which  it  is  conditioned. 

5.  A  better  test  statistic  for  independence,  other  than  Kendall’s  Tau  Correlation 
Coefficient,  may  exist. 

6.  An  application  for  designed  experiments  to  aid  in  spread  studies  or  other  such  risk-based 
comparisons. 
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Appendix  A.  Mathematical  Proofs 

A.l  Conjecture  Involving  the  Binomial  Coefficients 

Conjecture  1  (Relating  to  Binomial  Coefficients).  Given  any  positive  integer  m  <  47: 


m 


£ 

u=0 


(~1)U 
u  +  2 


(m  +  2)  (m  +  1) 


1 


(26) 


Proof.  By  exhaustion,  directly  calculated  for  m  <  47  using  Matlab®  (calculation  for 
integers  greater  than  47  exceeds  machine  precision  limits,  causing  unavoidable 
computational  error).  □ 


A. 2  Integrals  Involving  the  Standard  n-Simplex 

Theorem  2  (Volume  Under  Standard  Simplex).  Given  any  positive  integer  n  ,  along  with 
any  finite  sequence  {xi}®=1  of  real  variables,  the  multiplicative  inverse  of  the  integral  of  the 
identity  function  over  the  standard  n-simplex,  as  in  Definition  23,  Chapter  I,  is  simply  n!: 


fo  fo  X1  fo  X1  X2  •  •  •  fo  Ei_1  (Xi)  dxn  ■■■  dx3  dx2  dxi 
Proof.  To  prove  the  desired  result,  we  shall  prove  its  equivalent:  that  the  value  of  the 
denominator  on  the  left-hand  side  of  (27)  is  (yy).  We  begin  by  performing  the  first  three 
integrations  indicated,  working  from  the  inside  out,  to  determine  if  there  is  a  consistent 
pattern: 
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al-Xl 


"1-X1-X2  /‘i-EiLl  (xi) 


dxn  . . .  dx 3  dx2  dx  i 


1  f 1_Xl  /•1-£f=2  (x0 


o  jo 


a;n 


i-ESTi  (x0 

xn=0 


dxn_i  . . .  dx2  dxi 


1  /*!  X!  f  l-£r=l2  (x0  /  “zi 


*  = 


0  JO 


i=l 


1  —  [Xi  j  dxn_!  . . .  dx2  dxi 


1  r1_xi  f1- ESL2 (x0  /  h2 


o  Jo 


1  —  [xi  —  xn_!  j  dxn_!  . . .  dx2  dx 


"i-ElLd3  (xi) 


*  = 


•i-sr=i3  (-i)  /  ^  \2 

1  -  2_^  [xi]  )  dx n_2  . . .  dx2  dx  1 


JO 

f', 

/'■■■/ 

r 1  yi-EI^M 

'0  JO 


i=l 
n— 2 


1  -  ^[xi]  “  Xn-1 


i=l 


i-£r=-2(xi) 


dxn_2  . . .  dx2  dxi 


xn_i=0 


i=l 


'•i-£r=i3(xi)  /  ^ 


i=l 


1  -  ^  [xi]  -  xn_2  j  dxn_2  . . .  dx2  dx i 


n— 3 


1  -  J^[xi]  X°-2 


*  = 


3! 


i=l 

n_ 4  f,r  \  ,  n— 3  x  3 


i-£“=i3(xi) 


dxn_3  . . .  dxi 


Xn  — 2  — 0 


'0  Jo 


■i-Er=i  (-0/  . 

1  -  2_ ^  [xi]  )  dx n-3  . . .  dx2  dx  1 


i=l 


(28) 


A  consistent,  predictable  pattern  is  now  recognizable  on  the  lines  denoted  by  an 
asterisk  (*).  When  the  largest  remaining  variable  index  is  (n  —  k),  there  will  be  a  constant 
(dj)  in  front  of  the  integral  signs.  Also,  the  resulting  integrand  will  simply  be  the  quantity 

(n— k  X  k 

1  —  [xj]  |  ,  and  the  upper  limit  of  integration  for  the  innermost  integral  will  be  the 

/  n— (k+l)  x 

quantity  I  1  —  [xi]  j .  We  may  now  proceed  to  complete  the  calculation. 

Performing  repeated  integration  in  this  manner  until  the  largest  remaining  variable 
index  is  (n  —  [n  —  1]  =  1),  we  obtain  the  desired  result: 
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1  r  1-xi  r I-X1-X2  /-1-Er^  (x0 


0  JO  Jo 


n  -  1]  V  do 


dxn  . . .  dx3  dx2  dx  i 

’1-Er=l[(n-1)+11  (xi)  /  n-fn”1]  \  n— 1 

I  ~  ^  [Xi]  1  (/x„  [„ 


i=l 


(n  -  1) '  Jo 
1 

(n  -  1) !  . 

1 

n ! 


(l  —  Xi)11  dxi 


xi=0 


(29) 


□ 


Corollary  1  (Integral  of  an  Axis  Variable  Over  a  Standard  Simplex) .  Given  any  any  finite 

sequence  {xj}!’=|  of  axis  variables  for  a  standard  n-simplex  An;  if 
(-l)u  Aid 


E 

u=0 


n  +  2  V  u 


(m  +  2)  (m  +  1) 


=  1  holds  true  for  all  integers  m  <  n,  then  the 


multiplicative  inverse  of  the  integral  over  An  of  any  one  of  the  axis  variables  xj  G  {xi}|'_  ] 
is  (n  +  1)  !; 


1 


r  1  r  1— xi  r  1— xi-x2  rl-EAi1  (xi) 

Jo  Jo  Jo  '  '  '  Jo 


Xj  dxn  . . .  dx3  dx2  dxi 


(n+l)!,  Vj  =  1,2,3,  ...  ,  n 

(30) 


Proof.  To  prove  the  desired  result,  we  shall  prove  its  equivalent:  that  the  value  of  the 
denominator  on  the  left-hand  side  of  (30)  is  ^  j .  Without  loss  of  generality,  consider 
the  case  where  j  =  k,  for  some  fixed  k  =  1, 2,  3,  ...  ,  n: 
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1  fl— xi  /■!— x  1— X2 


'•i-£r=-1i  (xi) 


'0  Jo  Jo 
r  l 


"i-Eti1  (*i)  /■!- ElU  (xi) 


'0  ^0 


[xk]  ,  dxn  . . .  dx 3  dx2  dxi 
/■i-ErEi1  (xi) 

Jo 


xk  ,  dxn  . . .  dxk+ ,  dxk  . . .  dx! 


•i-Et'i1  (xi)  /■i-EiLi  (xi) 


“i-Er^i1  (xi) 


'0  ./o 


xk  ,  dxn  . . .  dx k+i  dx k  . . .  cixi 


"i-Eti1  (xi) 


'0  JO 


Xk]  , 


'  r i-Eti(x0 

r  i-EAh1  (xi) 

.  /  dxn  . 

■  ■  dxk+ 1 

dxk  . . 

. .  dxi 

Jo 

Jo 

(31) 


We  know  from  equation  patterns  in  the  proof  of  Theorem  2  above  that  the  term  in 
brackets  on  the  last  line  of  (31)  is  the  integrand  of  the  identity  function  integrated  over  Ar 
after  (n  —  k)  integrals  have  been  performed,  working  from  the  inside  out,  and  that  this 


integrand  is  simply  the  quantity 


[n— k] ! 


1  -]Ox0 


i=l 


n— k  ' 


;  therefore,  we  may  write: 


ni-xi  /■  1-X1-X2  /-i-Er=i  (xi) 

Jo 


Jo 

■i  fi-Etl1  (xi) 


'o  Jo 
r1  r1- EtT1  (xi) 


'0  JO 

/•I  /-1-EtT1  (Xi) 
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'•1  /-1-Eti1  (Xi) 


'0  JO 


[xk]  , 
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[xk]  > 
[xk]  > 
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/■i-Eti  (xi) 

/■i-EETi1  (^) 

- 

/ 

.  /  dxn  . 

■  ■  dxk+i 

/o 

Jo 

dx  k  . . .  dx  i 


(n  -  k) ! 
1 

(n  -  k) ! 
1 

(n  -  k) ! 


x  n— k 

i=l 
k— 1 

1  -  N  ~  Xk 

i=l 
k— 1 


dx k  . . .  dxi 

dx  k  . . .  dx  i 


n— k 


1  -ZXx0 


i=l 


n— k 


+  [-xk] 


dxk  . . .  dxi 

(32) 
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0 


t 

The  Binomial  Theorem  states  that  (a  +  b)*  = 

u=0 

(t— u) !  u ! '  Appiying  this  to  the  term  in  parentheses  in  (32),  we  may  write: 


(^jat~ubu,  where 


k-i 

£ 

i=l 


(Xi) 


n— k 


+  [-xk] 


n— k 

£ 

u=0 


-i]' 


>i  -  k] 
u 


k_l  -|  (n-k)-u 


1  —  5^(Xi) 


i=l 


w 


Since  we  have  arbitrarily  fixed  k,  let  us  temporarily  denote  m  =  n  —  k  to  ease 
notational  burdens.  This  enables  us  to  rewrite  (32)  above  as: 
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(33) 


where  we  have  interchanged  integration  with  finite  summation.  It  now  befalls  us  to 
evaluate  the  term  in  brackets  in  (33)  above.  We  will  perform  one  integration  first: 
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Multiplying  the  bracketed  term  in  (33)  by  the  quantity  (u  +  2),  we  may  now  write: 


r1  /-i-Eti1  (xi)  /  EE  .  ,, 
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and  if  we  additionally  multiply  the  bracketed  term  in  (33)  by  (m  +  3),  we  may  then  write: 
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A  consistent,  predictable  pattern  is  now  recognizable  on  the  lines  denoted  by  an 
asterisk  (*)  in  (34)  and  (35)  above.  When  the  largest  remaining  variable  index  is  (k  —  j), 
there  will  appear  a  constant  in  front  of  the  integral  signs  in  the  bracketed  term  in  (33): 
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Also,  the  resulting  integrand  will  simply  be  the  quantity 
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upper  limit  of  integration  for  the  innermost  integral  will  be  the  quantity 
We  may  now  proceed  to  complete  the  calculation. 


k— [j+l  ]  \ 

s w) 


Performing  repeated  integration  in  this  manner  until  the  largest  remaining  variable 
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index  is  (k  —  [k  —  2]  =  2),  if  we  multiply  the  bracketed  term  in  (33)  by  the  quantity 
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i.e.,  the  bracketed  term  in  (33)  may  be  simply  written  as: 
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(36) 


Substituting  the  result  of  (36)  for  the  bracketed  term  in  (33),  the  expression  we  wish 
to  prove  equal  to  j  for  the  case  where  j  =  k  may  now  be  written  as: 
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where  the  last  step  may  be  taken  due  to  the  fact  that  m  =  n  —  k  for  arbitrary 
k  =  1,  ...  n,  hence  m  <  n,  and  the  result  follows  from  the  hypothesis. 

□ 
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